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Abstract 

In lChibI (I1995I) . a method for approximating marginal densities in a Bayesian setting is pro- 
posed, with one proeminent application being the estimation of the nurnber of components in a 



normal mixture. As pointed out in ■Nca l (19991) and iFriihwirth- Schnatteij (|2004 ). the approxima- 
tion often fails short of providing a proper a pproximation to the true marginal densities because 



of the well-known label switching problem ( Celeux et al. . 2000l) . While there exist other alter 



natives to the deriva tion of approximate m argi nal densities, we reconsider the original proposal 
here and show as in Berkhof et al. I 2003f) and Lee et al. ( 2008 ) that it truly approximates the 
marginal densities once the label switching issue has been solved. 

Keywords: Bayesian model choice, conjugate prior, Rao-Blackwellisation, Markov Chain 
Monte Carlo (MCMC). 



1 Introduction 

Model choice is a central issue in mixture model ling because of the nonparametric nature of mixtures 
( Marin et al. . 20051 . Friihwirth-Schnattei . 20061 ). Indeed, while a distribution with a density of the 
form 
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where the densities g are known and the corresponding parameters ^^'s are unknown, is a well- 
defined object (with 9^ = (p^, . . . /uf , . . . , /x^), it occurs that, in most settings, the number 
of components k is uncertain and is an integral part of the inferential goals. This is true for 
classification as well as for estimation purposes, especially because of the weakly informative nature 
of mixtures: due to the representation of those distributions as sums of components samples 
from fk{x\9k) provide relatively little information about each of the components, in the sense that 
there always is a positive probability that no point in the sample has been generated from a 
particular component. 

Evaluating the number k of components from a sample x = (xi, . . . , x„) from ([1]) is therefore a 
quite relevant issue in the setting of mixtures and a standard Bayesian approach is to consider the 
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problem from a model choice perspective, i.e. to consider that each value of k defines a different 
model, with density 



i=l 



and corresponding parameter Of^, and to compute the corresponding Bayes factors 



-^fc,fc+i (^) 



mkix) 



f /fc+i(x|6lfc+i)7rfc+i(6lfc+i)d6lfc+i mfe+i(x) 



for all pairs (fc, k + 1) of interest. Obviously, there exist different Bayesian sol utions for the ap- 



Chen et al.l . l200d . 



proximation of BJ^ , , , (x) a nd this is well-documented in the literature (see, e.g. 
Friihwirth-Schnatteii . |2004| ). One possible solution is to derive the posterior probabilities of the dif- 



fe rent values of k (that ar e prop ortional to the mfc(x)'s) by an reversible jump MCMC algorithm as 



m 



Richardson and GreenI (|l997l l. But we consider however that there is a fundamental inefficiency 
in using a random walk like the reversible jump MCMC algorithm on a structure — the collection 
of mixture distributions with an unknown number of components — made of a rather small number 
of terms (since k is usually bounded): the resulting inherent randomness does not seem pertinent 
in a finite state space. For one thing, the proposed values of the parameters 9^ at each step of 
a reversible jump MCMC algorithm are less likely to be accepted than in a regular Gibbs sam- 
pling scheme because of (a) the introduction of an additional proposal to move between models 
and between the parameters of those models, rather than relying on the exact full conditionals 
of the true target distribution, (b) the comparison not only of values of the parameters within a 
model but in connection with the relative likelihoods of different models which, by its very nature, 
forces the corresponding Markov chain to remain more often in the more probable models and thus 
slows down the exploration of the less probable models, and (c) the lack of connection between the 
adjacent elements of the Markov chain since the parame t er spa ce changes at every step. This is 
of course arguable, as defended in iRichardson and GreenI (jl997l ) who maintain the opposite point 
of view that using a reversible jump algorithm improves the mixing of the Markov chain within 
each model. (This is certainly true from a probabilistic perspective, namely that two consecutive 
values of 6k are less correlated than in a Gibbs scheme because there is an arbitrary large number 
of intermediate simulations between those two values, but this does not answer the criticism that 
a proper exploration of each model, i.e. of each value of k, requires in the end a much larger num- 
ber of simulations than the sum of the numbers of simulations requested by the approximation of 
each posterior distribution 7rfc(0/^.|x), not to mention the addi tional level of complexity in designing 



efficient reversible jumps algorithms, see brooks et all l2nn.i ) 



Exploring each model/case separately by MCMC and then producing an approximation of the 
corresponding marginal densities is therefore more reasonable if those marginals can be correctly 
approximated. Once a sample from the posterior distribution 71^(0 has been produced, there 
are again many alter n atives for approximating the n iargin als mfc(x), as discussed in, for instance, 
Friihwirth-Schnatterl (j2004l ) or IChopin and Robert] (120071] . but the central point of this note is 
to stress the point already made in iBerkhof et al.l (j2003l ) that a proper approximation can be 
found when using a simple correction to Chib's (1995) marginal likelihood approximation, since 
this solution has somehow been overlooked in the literature, maybe due to the original controversy 
surrounding Chib's (1995) proposal. We recall in Section[2]the basis of Chib's (1995) approximation 
and the difficulties surrounding its implementation to the mixture problem, before presenting in 
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Section [3] our correction and demonstrating in Section [J] how this correction recovers the true 
marginal densities. 



2 The original proposal 



Chib's (1995) method for approximating a marginal (likelihood) is a direct application of Bayes' 
theorem: given x ~ fki^l^k) and 0^ ~ T^ki^k)-, we have that 



mfe(x) 



/fc(x|6'fc)7rfc(6'fc) 
7rfc(6'fc|x) 



for all 0's (since both the Ihs and the rhs of this equation are constant in 9). Therefore, if an 
arbitrary value of 6 j, 0* s ay, is selected and if a good approximation to 7r(0|x) can be constructed, 
7r(0|x) say, Chib's ( 19951 ) approximation to the marginal likelihood is 



mfc(x) 



fk{-^\ei)^k{ei) 



X 



(2) 



In the special setting of mixtures of distributions, Chib's (1995) approximation is particu- 
larly attractive as there e xists a natural approximation to 7rfc(0fc|x), based on the Rao-Blackwell 
(jCelfand and Smithl . [l99ol ^ estimate 



X 



1 ^ 
T 



where the z^*^'s are the latent va riables simulated by t he M CMC sampler. (We recall that the natu- 



ral Gibbs sampler in this setting iDiebolt and Robertl . 119901 is based on two steps: (i) the simulation 
of the latent variables z^^ that correspond to the component indicators, conditional on the param- 
eter Ok, and (ii) the simulation of the parameter Ok, conditional on the latent variables Zjk- When 
conjugate priors are used for 0^, step (ii) can be implemented in one block, see lDiebolt and Robert . 
1990, ICasella et all . [2004 ) 



The estimate 7ffc(0^|x) is a parametric unbiased approximation of 7rfc(^^|x) that converges with 
rate 0(\/T). This Rao-Blackwell approximation obviously requires the full conditional density 
-7rfc(0^|x, z) to be available in closed form (constant included), but this is the case when the compo- 
nent densities g{x\^i) are within an exponential family and when conjugate priors on the jiiS are 
used. 

To be efficient, Chib's (1995) method requires (a) a central choice of O"^ but, since in the case 
of mixtures, the likelihood is computable, O'^ can be chosen as the MCMC approximation to the 
MAP or to the ML e stima tor, and (b) a good approximation to 7rfc(0fc|x). This later requirement 
is the core of Neal's (jl999l ) criticism in the case of mixtures: while, at a formal level, 7ffc(0^|x) is 
a converging approximation of ■Kk{Ok\'x) by virtue of the ergodic theorem, this convergence result 
relies on the fact that the chain (z|^ ^) converges to its stationarity distribution. Unfortunately 



in the case of mixtures, as shown in ICeleux et all (|200nl ). the Gibbs sampler ra r ely co nverges in 
essence because of the (lack of) label switching phenomenon (see also Jasra et al. . 20051 ) . In short, 
due to the lack of identifiability of mixture models (since the components remain invariant under 
permutations of their indices), the posterior distribution is generaly multimodal and, in the case of 
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an exchangeable prior, it is also exchangeable. Therefore, when the Gibbs output fails to reproduce 
the exchangeability predicted by the theory, namely when it remains concentrated around one (or a 
sub set) o f the k \ modes of the posterior distribution, the approximation 7ffc(0^|x) is untrustworthy 
andlNeaJ (Il999l ) demonstrated via a n umer ical experiment that ([2]) is significantly different from the 
true value r?ifc(x) in that case. IChibI (|l995l ) tried to overcome this difficulty by using a constrained 
parameter set based on an identifiability constraint, but such constraints are notorious for slowing 
down the corresponding MC MC sampler and, mo re importantly, for failing to isolate a single mode 



of the posterior distribution ( Celeux et al. . 200d ) . 



3 The fix 



There is, however, an easy remedy to this problem, as already demonstrated in iBerkhof et al 
Since, when the prior distribution is exchangeable over the components of the mixture, the 
posterior distribution is also exchangeable, this means that 



o-ee 

for all (j's in (3^, set of all permutations of {1, . . . , k}. (The notation cr{6l) indicates the transform 
of 9'^ where components are switched according to the permutation a.) In other words, the distri- 
bution of interest is invariant over all permutations and the data brings no information about an 
ordering of the components. The lack of symmetry in an approximation 7ffc(^^|x) is therefore purely 
ancillary and integrating out this factor of randomness by recovering the label switching symmetry 
a posteriori can only reduce the variability of the approximation, by a standard Rao-Blackwell 
argument. We thus propose replacing 7ffc(^^|x) in ([2]) above with 



<TeSfc t=i 



x, z 



Note that this solution is t aking advantage of the symmetry predicted by the theory, following the 



general principles stated in iKong et al.l (|2003l ) 



The modified 7ffc(0^|x) is shown (through examples) in the next section to recover the missing 
mass lost in the lack of exploration of the k\ modes of the posterior density, rightly pointed out by 
Neall(ll999l ). When the Gibbs sampler starts exploring more than one mode of the posterior density. 



there is no loss in using the symmetrised estimator 7f/c(0^|x) (except for the additional computing 
time). In the case of "perfect symmetry", both estimators are identical, which is a good indicator 
of proper mixing. In other difference between both estimators points out a lack of mixing, 

at least from the point of view of exchangeability, and it may call for additional simulations with 
different starting points. The major question in such cases is to ascertain whether or not the Gibbs 
sa mpler has cqniplete ly explored at least one major mode of the posterior distribution. As shown 
in iMarin et al.l (j2005l ) , there may also exist secondary modes where a standard Gibbs sampler gets 
trapped. In such occurrences, even a symmetrised estimate of 7rfc(0fc|x) fails to produce a proper 
approximation of myfc(x), but this goes undetected. This is however unrelated with the original 
difficulty of Chib's (1995) approximation and trapping modes can be detected by usi ng tempering 



devices or other simulation algorithms like Population Monte Carlo (jPouc et al.l . 120071 ) . (We indeed 



point out that the approximation ([2]) can also be used in a setup where a sample 6^^ is directly 
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k 


2 


3 


4 


5 


6 


7 8 


mfc(x) 


-115.68 


-103.35 


-102.66 


-101.93 


-102.88 


-105.48 -108.44 



Table 1: Estimations of the marginal likelihoods by the symmetrised Chib's approximation (based 
on 10^ G i bbs it erations and, for A: > 5, 100 permutations selected at random in &k)- {Source: 
Lee et all . l2008l .) 



produced without data augmentation. Once the sample obtained, the z^*^'s can be simulated from 
the full conditional as side products.) 



4 Illustration 



In th is example, we consider the benchmark galaxy dataset (jRoederj . Il992l . iMengersen and Robert! . 
1993), that represents the distribution of the radial speeds of n = 82 galaxies as a mixture of k 
normal distributions with both mean and variance unknown. In this case, label switching mostly 
does not occur. If we compute log 7hk(x.) using only the original estimate, with O^. chosen as the MAP 
estimator, the (logarithm of the) estimated marginal likelihood is 7Tifc(x) = —105.1396 for k = 3 
(based on 10^ sii nulat i ons), while introducing the permutations leads to rhk (x ) = —103.3479. As 
aheady noted bv lNeall (|l999l ). the difference between the orig inal Chib's (|l99,5l ) approximation and 
the true marginal likelihood is close to log(A;!) (only) when the Gibbs sampler remains concentrated 
around a single mode of the posterior distribution. In the current case, we have that —116.3747 -|- 
log(2!) = —115.6816 exactly! (We also checked this numerical value against a brute-force estimate 
obtained by simulating from the prior and averaging the likelihood, up to fourth digit agreement.) 
A similar result holds for k = 3, with -105.1396 log(3!) = -103.3479. Both |NeaJ \l99^ ) and 
Friihwirth-Schnatter ( 20041 ) also pointed out that the log(A:!) difference was unlikely to hold for 
larger values of k as the modes were getting less separated on the posterior surface and thus the 
Gibbs sampler was more likely to explore in parts several modes. For A; = 4, we get for instance 
that the original Chib's ( 19951 ) approximation is —104.1936, while the average over permutations 
gives —102.6642. Similarly, for k = 5, the difference between —103.91 and —101.93 is less than 
log(5!). The log(A;!) difference cannot therefore be used as a direct correction for Chib's ()1995l ) 
approximation because of this difficulty in controlling the amount of overlap. But it is altogether 
unnecessary since using the permutation average resolves the difficulty. Table [1] shows that the 
prefered value of k for the galaxy dataset and the current choice of prior distribution is = 5. 

When the number of components k grows too large for all permutations in 6^ to be considered 
in the average, a (random) subsample of permutations can be simulated to keep the computing 
time to a reas onable level when ke eping the identity as one of the permutations, as in Table [T] for 
A; = 6, 7. (See iBerkhof et al.l . |2003| for another solution.) Note also that the discrepancy between 
the original Chib's (119951 ) approximation and the average over permutations is a good indicator of 
the mixing properties of the Markov chain, if a further convergence indicator is requested. 
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